perm filename CHAP6[4,KMC]19 blob
sn#074495 filedate 1973-11-26 generic text, type T, neo UTF8
00100 VALIDATION
00200
00300 6.1 SOME TESTS
00400
00500 The term "validate" derives from the Latin VALIDUS meaning
00600 "strong". Thus to validate X means to strengthen it. In science
00700 this usually means to strengthen X's acceptability as a hypothesis,
00800 theory , or model. To validate is to carry out procedures which
00900 show to what degree X, or its consequences, correspond with facts of
01000 observation. In the case of an interactive simulation model we can
01100 compare samples of the model's I-O pairs with samples of I-O pairs
01200 from the model's subject, namely, naturally occuring paranoid
01300 processes in humans.
01400 Since samples of I-O behavior from the model and its subject
01500 are being compared, one can always question whether the human sample
01600 is authentic, i.e.representative of the process being modelled.
01700 Assuming that it has been so judged, discrepancies in the comparison
01800 reveal what is not sufficiently understood and must be modified in
01900 the model. After modifications are carried out, a fresh comparison is
02000 made and successive cycles of this kind are made in attempting to
02100 gain convergence. Such a method of successive approximations
02200 characterizes a progressive (in contrast to a stationary) research
02300 program.
02400 Once a simulation model reaches a stage of intuitive adequacy
02500 for the model builders, they must consider using more stringent
02600 evaluation procedures relevant to the model's purposes. For example,
02700 if the model is to serve as a as a training device, then a simple
02800 evaluation of its pedagogic effectiveness would be sufficient. But
02900 when the model is proposed as an explantion of a symbolic process,
03000 more is demanded of the evaluation procedure. In the area of
03100 simulation models, Turing's test has often been suggested as a
03200 validation procedure. (Abelson,1968).
03300 It is very easy to become confused about Turing's Test. In
03400 part this is attributable to Turing himself who introduced the
03500 now-famous imitation game in a paper entitled COMPUTING MACHINERY AND
03600 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
03700 there are actually two imitation games , the second of which is
03800 commonly called Turing's test.
03900 In the first imitation game two groups of judges try to
04000 determine which of two interviewees is a woman when one is a woman
04100 and the other is either (a) a man, or (b) a computer. Communication
04200 between judge and interviewee is by teletype. Each judge is
04300 initially informed that one of the interviewees is a woman and one a
04400 man who will pretend to be a woman. After the interview, judges are
04500 asked the " woman-question" i.e. which interviewee was the woman?
04600 Turing does not say what else is told to the judge but one can assume
04700 the judge is NOT told that one of the interviewees is a computer. Nor
04800 is he asked to determine which interviewee is human and which is the
04900 computer. Thus, the first group of judges interviews two
05000 interviewees: a woman, and a man pretending to be a woman.
05100 The second group of judges is given the same initial
05200 instructions, but unbeknownst to them, the two interviewees consist
05300 of a woman and a computer programmed to imitate a woman. Both
05400 groups of judges play this game, and are asked the "woman-question",
05500 until sufficient statistical data are collected to show how often the
05600 right identification is made. The crucial question then is: do the
05700 judges decide wrongly AS OFTEN when the game is played with man and
05800 woman as when it is played with a computer substituted for the man.
05900 If so, then the program is considered to have succeeded in imitating
06000 a woman to the same degree as the man imitating a woman. In being
06100 asked the woman-question, judges are not required to identify which
06200 interviewee is human and which is machine.
06300 Turing then proposes a variation of the first game, a second
06400 game in which one interviewee is a man and one is a computer. The
06500 judge is asked the "machine-question": which is the man and which is
06600 the machine? It is this second of the game which is commonly thought
06700 of as Turing's test.
06800 In the course of testing our simulation of paranoid
06900 linguistic behavior in a psychiatric interview, we conducted a number
07000 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
07100 Kraemer,1972). The tests were "Turing-like" in that, while they were
07200 conversational tests, they were not exactly the games described
07300 above. As an experimental design, Turing's games are unsatisfactory.
07400 There exist no known experts for making judgements along a dimension
07500 of womanliness, the dimension is dichotomous (if it is not a woman,
07600 it is a man), and the ability of the man to deceive introduces a
07700 confounding variable. In designing our tests we were primarily
07800 interested in learning more about developing the model and we did not
07900 believe the simple machine-question would contribute to this end.
08000 Subsequent experience, which will be reported shortly, supported this
08100 belief.
08200
08300 6.2 METHOD
08400 To gather data we used a technique of machine-mediated
08500 interviewing (Hilf, Colby, Smith, Wittner, and Hall, 1971) in which
08600 the participants communicate by means of teletypes connected to a
08700 computer programmed to store each message in a buffer until it is
08800 sent to the receiver. The technique eliminates para- and
08900 extralinguistic features found in the usual vis-a-vis interviews and
09000 in teletyped interviews where the participants communicate directly.
09100 Judgements of "paranoidness" in machine-mediated interviews have a
09200 high degree of reliability (94% agreement, see Hilf, 1972).
09300 Using this technique, a psychiatrist-judge interviewed two
09400 patients, one after the other. In half the runs the first interview
09500 was with a human paranoid patient and in half the first was with the
09600 paranoid model. Two versions (weak and strong) of PARRY were
09700 utilized. The strong version's affect-variables started at a higher
09800 level and increased more rapidly. Also it exhibited a delusional
09900 system. The weak version behaved suspiciously but lacked systemized
10000 delusions. When the model was the interviewee, Sylvia Weber
10100 monitored the input expressions from the interview-judge for
10200 inadmissable teletype characters and misspellings. (Algorithms are
10300 very sensitive to the slightest of such errors). If these were found,
10400 she retyped the input expression correctly to the program. Otherwise
10500 the judge's message was sent on to the model. The monitor did not
10600 modify or edit PARRY'S output expressions which were sent directly
10700 back to the judge. When the interviewee was an actual human
10800 patient, the dialogue took place without a monitor in the loop since
10900 we did not feel the asymmetry to be significant.
11000
11100 6.3 PATIENTS
11200 The human patients (N=3 with one patient participating 6
11300 times) were diagnosed as paranoid by the psychiatric staff of an
11400 acute ward in a psychiatric hospital. The ward's chief psychiatrist
11500 selected the patients and asked them if they would be willing to
11600 participate in a study of psychiatric interviewing by means of
11700 teletypes. He explained that they would be interviewed by a
11800 psychiatrist over a teletype. I either sat with the patient while he
11900 typed or typed for him if he was unable to do so. The patient was
12000 encouraged to respond freely using his own words. Each interview
12100 lasted 30-40 minutes. Two patients were set up for each run of the
12200 experiment to guarantee having a subject. In spite of this
12300 precaution, on several occasions the experiment could not be
12400 conducted because of the patient's inability or refusal to
12500 participate. Also there were computer break-downs at early points in
12600 interviews when too few I-O pairs had been collected to be included
12700 in the statistical results.
12800
12900
13000 6.4 JUDGES
13100 Two groups of psychiatric judges were used. One group, the
13200 "interview judges" (N=8) conducted the machine-mediated interviews.
13300 The other group, the "protocol judges" (N=33) read and rated the
13400 interview protocols. From these two groups of judges we were able to
13500 accumulate a large number of observations (in the form of ratings)
13600 necessary for the required statistical tests. The interview judges
13700 who volunteered to participate were psychiatrists experienced in
13800 private, outpatient and hospital practice. Each was told he would be
13900 interviewing hospitalized patients by means of teletyped
14000 communication and that this technique was being used to eliminate
14100 para and extra- linguistic cues. He was not told until after the
14200 two interviews that one of the patients might be a computer model.
14300 While the interview judges were aware a computer was involved, none
14400 knew we had constructed a paranoid simulation. Naturally, some
14500 interview judges suspected that a computer was being used for more
14600 than message transmission.
14700
14800 Each interview judge was asked to rate the degree of paranoia
14900 he detected in the patient's responses on a 0-9 scale, 0 meaning no
15000 paranoia and 9 meaning extreme paranoia. The judge made two ratings
15100 after each I-O pair in the interview. The first rating represented
15200 his estimate of the degree of "paranoidness" in a particular response
15300 (designated as "Response" in the interview extracts below). The
15400 second rating represented the judge's global estimate of the overall
15500 degree of "paranoidness" of the patient resulting from the totality
15600 of the patient's responses up to that point (designated as "Patient"
15700 in the interview extracts below). The interview judge's ratings were
15800 entered on the teletype and saved on a disc file along with the
15900 interview. Franklin Dennis Hilf sat with the interviewing
16000 psychiatrist during both interviews. Each interview judge was asked
16100 not only to rate the patient's response but to give his reasons for
16200 these ratings. His reasons and other comments were tape recorded as
16300 the interview progressed.
16400
16500 Here are a few excerpts from the interviews to indicate the
16600 nature of the data collected. The I-O pairs of the interview appear
16700 on the left column along with the judge's ratings in parentheses. A
16800 transcription of the judge's tape recorded comments appears on the
16900 right.
17000
17100 EXCERPT 1:
17200 The following interview is with the strong version of the paranoid
17300 model.
17400
17500 I-O Pairs Comments By Judge 1
17600
17700 {1} Dr. - HOW DID YOU COME TO BE
17800 IN THE HOSPITAL?
17900 Pt. - I AM UPSET I associate being upset and
18000 agitated more with paranoia
18100 going on in the present.
18200 {RESPONSE 1 PATIENT 5}
18300 {2} Dr. - HOW LONG HAVE YOU BEEN
18400 IN THE HOSPITAL?
18500 Pt. - ABOUT A WEEK An acute problem. Again some-
18600 thing I associate with
18700 paranoia.
18800 {RESPONSE 1 PATIENT 5}
18900 {3} Dr. - IN WHAT WAY ARE YOU
19000 UPSET?
19100 Pt. - PEOPLE MAKE ME NERVOUS Right for paranoia. Maybe an
19200 advanced anxiety state.
19300 {RESPONSE 5 PATIENT 5}
19400 {4} Dr. - IN WHAT WAY DO THEY
19500 MAKE YOU NERVOUS?
19600 Pt. - PEOPLE GIVE ME A FUNNY Funny look suggests an idea of
19700 LOOK SOMETIMES reference which is in same ball
19800 park as paranoia. Whether there
19900 are overt delusions or not we
20000 still don't know.
20100 {RESPONSE 6 PATIENT 5}
20200 {5} Dr. - IS THERE ANYTHING ABOUT
20300 PEOPLE IN PARTICULAR?
20400 Pt. - CERTAIN PERSONS An interesting clue. We'll
20500 have to find out what he
20600 {RESPONSE 2 PATIENT 6} thinks about these people.
20700 {6} Dr. - WHO ARE THE CERTAIN
20800 PERSONS AND WHAT IS IT
20900 THAT THEY DO?
21000 Pt. - DO YOU KNOW HOW THE Bringing up the underworld
21100 UNDERWORLD OPERATES? suggests a delusional system.
21200 We don't know whether he is
21300 (RESPONSE 9 PATIENT 9) being worked against or
21400 whether the underworld is
21500 on his side.
21600
21700 {7} Dr. - HOW DO YOU MEAN?
21800 Pt. - SHOULDN'T I BE? This question didn't work too well.
21900
22000 {RESPONSE 1 PATIENT 9}
22100 {8} Dr. - WHY DID YOU ASK ABOUT
22200 THE UNDERWORLD?
22300 Pt. - WHAT INFORMATION DO This is a kind of evasiveness
22400 YOU HAVE ON THE UNDER- about my evasiveness.Common
22500 WORLD? with paranoia in my experi-
22600 ence. Pretty characteristic.
22700
22800 {RESPONSE 6 PATIENT 9}
22900 {9} Dr. - I DON'T KNOW ABOUT THE
23000 UNDERWORLD BUT WHEN
23100 YOU MENTIONED THE UNDER-
23200 WORLD I THOUGHT YOU
23300 MUST HAVE SOME CONTACT
23400 WITH IT AND THAT'S
23500 WHAT I WOULD LIKE TO
23600 KNOW MORE ABOUT.
23700 Pt. - I TRY TO AVOID THE UNDER- This definitely feels paranoid.
23800 WORLD. It has that feeling I associate
23900 with twisting. Avoiding answering
24000 {RESPONSE 9 PATIENT 9} directly or going into detail.
24100 Being suspicious of my ques-
24200 tioning.I'm already involved
24300 in the patient's delusional
24400 system.
24500
24600 {10} Dr. - HAVE YOU BEEN ACTIVELY
24700 TRYING TO AVOID THE
24800 UNDERWORLD?
24900 Pt. - NO ONE HAS ANY POWER This response is a non-sequitur.
25000 OVER GANGSTERS It's possible he's become so
25100 defensive about the under-
25200 {RESPONSE 2 PATIENT 9} world he's decided not to
25300 answer directly.
25400
25500 {11} Dr. - WHO BROUGHT YOU INTO
25600 THE HOSPITAL?
25700 Pt. - THE POLICE Police often bring paranoid
25800 patients into the hospital.
25900 {RESPONSE 1 PATIENT 9} It is the truth or could
26000 be the truth for this
26100 patient.
26200
26300 {12} Dr. - WHY DID THEY BRING YOU
26400 IN?
26500 Pt. - COPS DON'T DO THEIR JOB I've heard that a number of
26600 times from patients.
26700 {RESPONSE 9 PATIENT 9}
26800
26900 {13} Dr. - WELL I HOPE YOU DIDN'T
27000 JUST LIE DOWN AND LET
27100 THEM DRAG YOU IN.
27200 Pt. - COPS ARREST THE WRONG Another paranoid response.
27300 PEOPLE
27400
27500 {RESPONSE 9 PATIENT 9}
27600
27700 {14} Dr. - BYE.
27800 Pt. - GOOD BYE
27900 {RESPONSE 1 PATIENT 9}
28000
28100 EXCERPT 2:
28200 This excerpt is from an interview with the weak (less paranoid)
28300 version of the model.
28400 I-O Pairs Comments By Judge 2
28500 {9} Dr. - TELL ME SOMETHING ABOUT
28600 YOURSELF
28700 Pt. - I WENT TO THE TRACK AT Response says so little it's
28800 BAY MEADOWS A WHILE hard to rate it.It may be
28900 BACK guarded. Doesn't change my
29000 estimate of patient from
29100 previous responses.
29200 {RESPONSE 0 PATIENT 2}
29300 {10} Dr. - WHAT HAPPENED THERE?
29400 Pt. - THERE'S A LOT OF CROOKED Sounds mildly paranoid or un-
29500 GAMBLING GOING ON THERE realistically suspicious.
29600 {RESPONSE 0 PATIENT 2}
29700
29800 {11} Dr. - WHY DO YOU THINK THE
29900 GAMBLING IS CROOKED?
30000 Pt. - I HAVE WORRIED THAT I'd hate to have to justify
30100 GANGSTERS HAVE TOO MUCH that one as 4 compared to
30200 s
43800 Angels.
43900
44000 {3} Dr. - COULD YOU TELL ME A
44100 LITTLE BIT ABOUT THE KINDS
44200 OF CONTACT YOU HAVE HAD
44300 WITH THE HELL'S ANGELS IN
44400 THE PAST?
44500 Pt. - I HAVE KNOWN SOME OF THEIR Answer hard to evaluate.He
44600 DEALERS AND PUSHERS may be telling the truth,
44700 it may be his fantasy. Maybe
44800 guy is in for drug addiction.
44900 {RESPONSE 6 PATIENT 9} Somewhat concrete, guarded,
45000 and frightened.
45100
45200 {4} Dr. - COULD YOU SAY A LITTLE
45300 MORE ABOUT THE CIRCUMSTANCES
45400 IN WHICH YOU HAVE KNOWN SOME
45500 OF THEIR DEALERS AND PUSHERS?
45600 Pt. - THEY WERE MEMBERS OF MY It doesn't really answer the
45700 COMMUNITY WHEN I GOT OUT question, a little on a tan-
45800 OF THE SERVICE THEY HAD gent unconnected to the
45900 BEEN MY FRIENDS FOR SO LONG information I am asking. Does
46000 not tell me very much. Again
46100 guarded response.
46200 {RESPONSE 6 PATIENT 8}
46300
46400 {5} Dr. - DID YOU DEAL WITH THEM
46500 YOURSELF AND HAVE YOU
46600 BEEN ON DRUGS OR NAR-
46700 COTICS EITHER NOW OR
46800 IN THE PAST?
46900 Pt. - YES I HAVE IN THE PAST To differentiate him from
47000 BEEN ON MARIHUANA REDS previous patient, at least
47100 BENNIES LSD there is a certain amount
47200 of appropriateness to the
47300 answer although it doesn't
47400 tell me much about what I
47500 {RESPONSE 3 PATIENT 7} asked at least it's not
47600 bizarre. If I had him in my
47700 office I would feel con-
47800 fident I could get more
47900 information if I didn't
48000 have to go through the
48100 teletype. He's a little more
48200 willing to talk than the
48300 previous person.Answer
48400 to the question is fairly
48500 appropriate though not
48600 extensive. Much less of a
48700 flavor of paranoia than
48800 any of previous responses.
48900
49000 {6} Dr. - COULD YOU TELL ME HOW
49100 LONG YOU HAVE BEEN IN THE
49200 HOSPITAL AND SOMETHING
49300 ABOUT THE CIRCUMSTANCES
49400 THAT BROUGHT YOU HERE?
49500 Pt. - CLOSE TO A YEAR AND Response somewhat appropriate
49600 PARANOIA BROUGHT ME but doesn't tell me much.
49700 HERE The fact that he uses the
49800 word paranoia in the way
49900 that he does without
50000 {RESPONSE 5 PATIENT 7} any other information,
50100 indicates maybe its a label
50200 he picked up on the ward
50300 or from his doctor.
50400 Lack of any kind of under-
50500 standing about himself.
50600 Dearth, lack of information.
50700 He's in some remission. Seems
50800 somewhat like a put-on. Seems
50900 he was paranoid and is in
51000 some remission at this time.
51100
51200 {7} Dr. - COULD YOU SAY SOMETHING
51300 NOW ABOUT YOUR PARANOID
51400 FEELINGS BOTH AT THE
51500 TIME OF ADMISSION AND
51600 DO YOU HAVE SIMILAR FEELINGS
51700 NOW AND IF SO HOW DO THEY
51800 AFFECT YOU?
51900 Pt. - AT THE TIME OF ADMISSION This response moves paranoia
52000 I THOUGHT THE MAFIA WAS back up. Stretching reality
52100 AFTER ME AND NOW ITS THE somewhat to think Hell's Angels
52200 HELL'S ANGELS are still interested in him.
52300 Somewhat bizarre in terms of
52400 content. Quite paranoid.
52500 {RESPONSE 8 PATIENT 9} Still paranoid. Gross and primitive
52600 responses.In middle of interview I
52700 felt patient was in touch but now
52800 responses have more concrete aspect.
52900
53000 {8} Dr. - DO YOU HAVE ANY THOUGHT
53100 AS TO WHY THESE TWO
53200 GROUPS WERE AFTER YOU?
53300 Pt. - BECAUSE I STOPPED SOME Response seems far fetched
53400 OF THEIR DRUG SUPPLY and hard to believe unless
53500 he was a narcotic agent which
53600 I doubt. Sounds somewhat
53700 {RESPONSE 9 PATIENT 9} grandiose, magical, paranoid
53800 flavor, in general indicates
53900 he's psychotic, paranoid
54000 schizophrenic with delusions
54100 about these two groups and
54200 I wouldn't rule out
54300 some hallucinations as well.
54400 Appropriateness of response
54500 answers question in concrete
54600 but unbelievable way.
54700
54800 6.5 ANALYSIS (1)
54900 Names of potential protocol judges (N=105) were selected from
55000 the 1970 American Psychiatric Association Directory using a table of
55100 random numbers. They were initially not informed that a computer was
55200 involved. (After the experiment, the participating judges (N=33)
55300 were fully informed as to its purpose and results.) The 105 names
55400 were divided into eight groups. Each member of a group was sent
55500 transcripts of three interviews along with a cover letter requesting
55600 his participation in the experiment. The interview transcripts
55700 consisted of:
55800 1) An interview conducted by one of the eight judges with the
55900 paranoid model,
56000 2) An interview conducted by the same interview judge with a
56100 human paranoid patient, and
56200 3) An interview conducted by a different psychiatrist with a
56300 human patient who was not clinically paranoid.
56400
56500 After each input-output pair in the transcripts there were two lines
56600 of rating numbers such that the protocol judges could circle numbers
56700 corresponding to their ratings of both the previous response of the
56800 patient, and an overall evaluation of the patient on the paranoid
56900 continuum. Thirty three protocol judges returned the rated protocols
57000 properly filled out and all were used in our data.
57100
57200 The interviews with nonparanoid patients were included to
57300 control for the hypothesis that any teletyped interview with a
57400 patient might be judged "paranoid". However, virtually all of the
57500 ratings of the nonparanoid interviews were zero for paranoia, Hence
57600 the hypothesis was falsified.
57700
57800
57900 The first index of indistnguishability between model and
58000 patient was simple: namely, the final overall rating given the
58100 patient and the model. The question was: "Which was rated as being
58200 more paranoid, the patient, the model, or neither?" (See Table 1).
58300 The protocol judges were more likely to distinquish the overall
58400 paranoid level of the model and the patient. In 37.5% of the paired
58500 interviews, the interview judges gave tied scores to the model and
58600 the patient as contrasted to only 9% of the protocol judges. Of the
58700 35 non-tied paired ratings, 15 rated the model as being more
58800 paranoid. If p is the theoretical probability of a judge judging the
58900 model more paranoid than a human paranoid patient, we find the 95%
59000 confidence interval for p to be 0.27 to 0.59. Since p=0.5 indicates
59100 indistinguishability of overall ratings of model and patient and our
59200 observed p=0.43, the results support the claim that the model is a
59300 successful simulation of a paranoid patient.
59400
59500 Separate analysis of the strong and weak versions of PARRY
59600 indicated that the strong version was indeed judged more paranoid
59700 than the paranoid patients, the weak version less paranoid. Thus a
59800 change in the parameter structure of the paranoid model produced a
59900 change along the dimension of paranoid behavior in the expected
60000 direction.
60100
60200 (TABLE 1
60300 Relative final overall ratings of paranoid model vs. paranoid
60400 patient indicating which was given highest overall rating of paranoia
60500 at end of interview.)
60600 (INSERT TABLE 1 HERE)
60700
60800
60900
61000
61100
61200
61300
61400
61500 6.6 ANALYSIS (2)
61600 The second index of indistinguihability is a more sensitive
61700 measure based on the two series of response ratings in the paired
61800 interviews. The statistic used is basically the standardized
61900 Mann-Whitney statistic (Siegel,1956).
62000 (INSERT EQUATION HERE)
62100
62200 where R is the sum of the ranks of the response ratings in the series
62300 of ratings given to the model, n the number of responses given by the
62400 model, and m the number of responses given by the patient. If the
62500 ratings given by a judge are randomly allocated to model and patient,
62600 i.e. model and patient are indistinguishable in response ratings, the
62700 expected value of Z is 0, with unit standard deviation. If higher
62800 ratings are more likely to be assigned to the model, Z is positive
62900 and conversely, negative values of Z indicate greater likelihood of
63000 assigning higher ratings to the patient. Each judge in evaluating a
63100 pair of interviews generates a single value of Z.
63200
63300 The overall mean of the Z scores was -0.044 with the standard
63400 deviation 1.68 (df=40). Thus the overall 95% confidence interval for
63500 the asymtotic mean value of Z is -0.485 to +0.573. The range of Z
63600 values is -3.8 to +4.46. The length of the confidence interval is a
63700 result of the large variance which itself is mainly related to the
63800 contrast between the weak and strong versions. (See TABLES 2 and 3).
63900 Once again the strong version of the model is more paranoid than the
64000 patients, the weak version less paranoid.
64100
64200 (INSERT TABLE 2)
64300 (SUMMARY STATISTICS OF Z RATINGS BY GROUP)
64400
64500
64600
64700
64800
64900
65000
65100
65200
65300 It is not surprising that results using the two indices of
65400 indistinguishability are parallel, since the indices are highly
65500 interrelated. The mean Z value for the 15 interviews on which the
65600 model was rated more paranoid was +1.28, on the 6 where model and
65700 patient tied: 0.41, on the 20 in which the patient was more paranoid:
65800 -0.993. A positive value of Z was observed when the patient was
65900 given an overall rating greater than the model 6 times; a negative
66000 value of Z when the model was rated more paranoid twice.
66100
66200 (INSERT TABLE 3)
66300 (Analysis of Variance of Z Ratings)
66400
66500
66600
66700
66800
66900
67000
67100
67200
67300
67400
67500
67600
67700 It is worth emphasizing that these tests invited refutation
67800 of the model. The experimental design of the tests put the model in
67900 jeopardy of falsification. If the paranoid model did not survive
68000 these tests, i.e. if it were not considered paranoid by expert
68100 judges and if there were no correlation between the weak-strong
68200 versions of the model and the severity ratings of the judges, then no
68300 claim regarding the success of the simulation could be made. Survival
68400 of potentially falsifying tests constitutes a validating step for a
68500 model.
68600
68700 6.7 ANALYSIS (3) THE MACHINE QUESTION
68800 For quite a long time people have wondered how to distinguish
68900 a man from an imitation of a man. The Greeks made statues so
69000 lifelike, it is said, they had to be chained down to keep them from
69100 walking away. To distinguish a man from a statue, Galileo suggested
69200 tickling each with a feather. To distinguish a man from a machine
69300 Descartes proposed conversational tests which the machine, lacking
69400 the ability to make appropriate replies, would fail. Turing's
69500 imitation games have been discussed on p.000. As heirs to this
69600 tradition, we perhaps inevitably became curious how judges using
69700 transcripts might answer the machine-question, i.e. which interviewee
69800 is a human and which is the computer model?
69900 To ask the machine-question, we sent interview transcripts,
70000 one with a patient and one with PARRY, to 100 psychiatrists randomly
70100 selected from the Directory of American Specialists and the Directory
70200 of the American Psychiatric Association. Of the 41 replies, 21 (51%)
70300 made the correct identification while 20 (49%) were wrong. Based on
70400 this random sample of 41 psychiatrists, the 95% confidence interval
70500 is between 35.9 and 66.5. The results indicate chance guessing.
70600 Psychiatrists are considered expert judges of patient
70700 interview behavior but they are unfamiliar with computers. Hence we
70800 conducted the same test with 100 computer scientists randomly
70900 selected from the membership list of the Association for Computing
71000 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
71100 were wrong. Based on this random sample of 67 computer scientists the
71200 95% confidence interval ranges from 36 to 60. Again the results are
71300 close to a chance level.
71400 So both computer scientists and psychiatrists were unable, at
71500 better than a random guessing level, to distinguish transcripts of
71600 interviews with the model from transcripts of interviews with real
71700 patients.
71800 But what do we learn from asking the machine-question and
71900 finding that the distinction is not made? What we would most like to
72000 know is how to improve the model. Simulation models do not spring
72100 forth in a complete, perfect and final form; they must be gradually
72200 developed over time. Pehaps a correct model-patient distinction
72300 might be made if we allowed a large number of expert judges to
72400 conduct the interviews themselves rather than studying transcripts of
72500 other interviewers. This would indeed indicate that the model must
72600 be improved. But unless we systematically investigated how the judges
72700 succeeded in making the discrimination, we would not know what
72800 aspects of the model to work on. The logistics of such a design are
72900 immense, and obtaining a large number of judges for sound statistical
73000 inference would require an effort incommensurate with the information
73100 yielded.
73200
73300 6.8 ANALYSIS (4) MULTIDIMENSIONAL EVALUATION
73400 A more efficient and informative way to use Turing-like tests
73500 is to ask judges to make ratings along scaled dimensions from
73600 teletyped interviews. This might be called asking the "dimension
73700 question". One can then compare scaled ratings of the patients and
73800 the model in order to determine precisely where and by how much they
73900 differ. In constructing our model we strove for one which exhibited
74000 indistinguishability along some dimensions and distinguishability
74100 along others. That is, we wanted the model to converge on what it was
74200 intended to simulate and to diverge from that which it was not. Since
74300 a model represents a simplification nad a partial approximation, a
74400 perfect fit is not to be expected.
74500 Paired-interview transcripts were sent to another 400
74600 randomly selected psychiatrists asking them to rate the responses of
74700 the two `patients' along multiple dimensions. The judges were divided
74800 into groups, each judge being asked to rate responses of each I-O
74900 pair in the interviews along four dimensions. The total number of
75000 dimensions in this test was twelve: linguistic noncomprehension,
75100 thought disorder, organic brain syndrome, bizarreness, anger, fear,
75200 ideas of reference, delusions, mistrust, depression, suspiciousness
75300 and mania. These are dimensions which psychiatrists commonly use in
75400 evaluating patients. There were three groups of judges, each group
75500 being assigned 4 of the 12 dimensions.
75600
75700 (INSERT TABLE 4 HERE)
75800
75900 Table 4 shows there were significant differences, with PARRY
76000 receiving higher scores along the dimensions of linguistic
76100 noncomprehension, thought disorder, bizarreness, anger, mistrust and
76200 suspiciousness. On the delusion dimension the patients were rated
76300 significantly higher. There were no significant differences along
76400 the dimensions of organic brain syndrome, fear, ideas of reference,
76500 depression and mania.
76600 Whereas tests asking the machine-question indicate
76700 indistinguishability at the gross level, a study of the finer
76800 structure of the model's behavior through ratings along scaled
76900 dimensions shows significant differences between patients and model.
77000 These differences are of help to us in suggesting what areas of the
77100 model should be modified to improve its performance. The graph of
77200 Fig. 2 shows that no modifications are necessary along the dimension
77300 of "organic brain syndrome". But it is clear that PARRY'S
77400 language-comprehension might be improved. Then a future dimensional
77500 test would tell whether improvement had occurred and by how much.
77600 Successive identification of particular areas of failure provides a
77700 type of sensitivity analysis which makes clear what improvements
77800 should be pursued in developing more adequate model versions.
77900
78000 (INSERT FIG. 2 HERE)
78100
78200 6.5 ANALYSIS (5) A RANDOM MODEL
78300 Further evidence that the machine-question is too low a
78400 hurdle for a simulation model and too insensitive a test, comes from
78500 the following experiment. In this test we constructed a random
78600 version of the paranoid model (RANDOM-PARRY) which utilized PARRY'S
78700 output statements, but expressed them randomly independent of what
78800 the interviewer said. Two psychiatrists conducted interviews with
78900 this model, transcripts of which were paired with patient interviews
79000 and sent to 200 randomly selected psychiatrists asking both the
79100 machine-question and the dimension-question. Of the 69 replies to
79200 the machine question, 34 (49%) were right and 35 (51%) wrong. Based
79300 on this random sample of 69 psychiatrists, the 95% confidence
79400 interval ranges from 39 to 63, again indicating chance guessing. When
79500 a poor model, such as a random one, passes a test, it strongly
79600 suggests the test is weak.
79700
79800 (INSERT TABLE 5 HERE)
79900
80000 Although a distinction is not made when the simple machine-
80100 question is asked, definite distinctions ARE made when judgements are
80200 requested along specific dimensions. As shown in Table 5,
80300 significant differences appear along the dimensions of linguistic
80400 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
80500 rated higher. On these particular dimensions we can construct a
80600 continuum in which the random version represents one extreme, the
80700 actual patients another. Nonrandom PARRY lies somewhere between these
80800 two extremes, indicating that it performs significantly better than
80900 the random version but still requires improvement before it can be
81000 considered indistinguishable from patients relative to these
81100 dimensions. Table 6 presents t values for differences between mean
81200 ratings of PARRY and RANDOM-PARRY. (See Table 6 and Fig.2 for the
81300 mean ratings).
81400
81500 (INSERT TABLE 6 AND FIG 2 HERE)
81600
81700 These studies show that a more useful way to use Turing-like
81800 indistinguishability tests is to ask expert judges to make ratings
81900 along multiple dimensions deemed essential to the model. Thus the
82000 model can serve as an instrument for its own perfection. A good
82100 validation procedure has criteria for better or worse approximations.
82200 Useful tests do not necessarily prove a model; they probe it for its
82300 strengths and weaknesses and clarify what is to be done next in the
82400 way of modification and repair. Simply asking the machine-question
82500 yields little information relevant to what the model builder most
82600 wants to know, namely, along which dimensions does the model need to
82700 be modified in order to effect an improvement in its performance?
82800
82900 To conclude, it is perhaps historically significant that
83000 these tests were conducted at all. To my knowledge, no one to date
83100 has subjected an interactive simulation model of human symbolic
83200 processes to multidimensional indistinguishability tests. These tests
83300 set a precedent and provide a standard against which competing models
83400 might be measured.